2025-05-13-12-09
A Grounded Memory System For Smart Personal Assistants
Abstract
arXiv:2505.06328v1 Announce Type: new Abstract: A wide variety of agentic AI applications - ranging from cognitive assistants for dementia patients to robotics - demand a robust memory system grounded in reality. In this paper, we propose such a memory system consisting of three components. First, we combine Vision Language Models for image captioning and entity disambiguation with Large Language Models for consistent information extraction during perception. Second, the extracted information is represented in a memory consisting of a knowledge graph enhanced by vector embeddings to efficiently manage relational information. Third, we combine semantic search and graph query generation for question answering via Retrieval Augmented Generation. We illustrate the system's working and potential using a real-world example.
摘要
从面向痴呆患者的认知辅助到机器人技术等各类智能体AI应用,都需要一个基于现实的鲁棒记忆系统。本文提出了一种由三个组件构成的记忆系统:首先,我们结合视觉语言模型(用于图像描述和实体消歧)与大语言模型(用于感知过程中保持信息提取的一致性);其次,将提取的信息存储于由向量嵌入增强的知识图谱内存中,以高效管理关系型信息;最后,我们通过检索增强生成技术,整合语义搜索与图谱查询生成来实现问答功能。通过真实案例展示了该系统的工作原理及应用潜力。
KCluster: An LLM-based Clustering Approach to Knowledge Component Discovery
Abstract
arXiv:2505.06469v1 Announce Type: new Abstract: Educators evaluate student knowledge using knowledge component (KC) models that map assessment questions to KCs. Still, designing KC models for large question banks remains an insurmountable challenge for instructors who need to analyze each question by hand. The growing use of Generative AI in education is expected only to aggravate this chronic deficiency of expert-designed KC models, as course engineers designing KCs struggle to keep up with the pace at which questions are generated. In this work, we propose KCluster, a novel KC discovery algorithm based on identifying clusters of congruent questions according to a new similarity metric induced by a large language model (LLM). We demonstrate in three datasets that an LLM can create an effective metric of question similarity, which a clustering algorithm can use to create KC models from questions with minimal human effort. Combining the strengths of LLM and clustering, KCluster generates descriptive KC labels and discovers KC models that predict student performance better than the best expert-designed models available. In anticipation of future work, we illustrate how KCluster can reveal insights into difficult KCs and suggest improvements to instruction.
摘要
教育工作者通过将评估问题映射到知识组件(KC)的知识组件模型来评估学生知识。然而,为大型题库设计KC模型对需要手动分析每个问题的教师而言仍是难以克服的挑战。教育领域生成式AI的日益普及预计只会加剧这种专家设计KC模型的长期不足,因为课程设计者难以跟上问题生成的速度。本研究提出KCluster——一种基于大型语言模型(LLM)诱导的新相似性度量来识别一致性问题簇的新型KC发现算法。我们在三个数据集中证明,LLM可以创建有效的问题相似性度量,聚类算法可借此以最小人力投入从问题中创建KC模型。结合LLM与聚类的优势,KCluster能生成描述性KC标签,并发现比现有最佳专家设计模型更能预测学生表现的KC模型。针对未来研究,我们展示了KCluster如何揭示困难KC的洞见,并为教学改进提供建议。
A New DAPO Algorithm for Stock Trading
Abstract
arXiv:2505.06408v1 Announce Type: new Abstract: Recent advances in reinforcement learning, such as Dynamic Sampling Policy Optimization (DAPO), show strong performance when paired with large language models (LLMs). Motivated by this success, we ask whether similar gains can be realized in financial trading. We design a trading agent that combines an improved Group Relative Policy Optimization (GRPO) algorithm, augmented with ideas from DAPO, with LLM-based risk and sentiment signals extracted from financial news. On the NASDAQ-100 index (FNSPID dataset), our agent attains a cumulative return of 230.49 percent and an information ratio of 0.37, outperforming the CPPO-DeepSeek baseline. It also cuts training time from about 8 hours to 2.5 hours over 100 epochs while markedly reducing RAM usage. The proposed RL-LLM framework offers a scalable path toward data-efficient trading agents. Code: https://github.com/Ruijian-Zha/FinRL-DAPO-SR/
摘要
强化学习领域的最新进展,如动态采样策略优化(DAPO),在与大语言模型(LLM)结合时展现出强劲性能。受此成功启发,我们探讨类似优势能否在金融交易中实现。我们设计了一个交易智能体,将改进的组相对策略优化(GRPO)算法(融合了DAPO的思想)与基于LLM的金融新闻风险及情绪信号相结合。在纳斯达克100指数(FNSPID数据集)上,该智能体累计收益率达230.49%,信息比率为0.37,优于CPPO-DeepSeek基线模型。同时,在100个训练周期内将训练时间从约8小时缩短至2.5小时,并显著降低内存占用。所提出的RL-LLM框架为构建数据高效型交易智能体提供了可扩展路径。
Challenging GPU Dominance: When CPUs Outperform for On-Device LLM Inference
Abstract
arXiv:2505.06461v1 Announce Type: new Abstract: The common assumption in on-device AI is that GPUs, with their superior parallel processing, always provide the best performance for large language model (LLM) inference. In this work, we challenge this notion by empirically demonstrating that, under certain conditions, CPUs can outperform GPUs for LLM inference on mobile devices. Using a 1-billion-parameter LLM deployed via llama.cpp on the iPhone 15 Pro, we show that a CPU-only configuration (two threads, F16 precision) achieves 17 tokens per second, surpassing the 12.8 tokens per second obtained with GPU acceleration. We analyze the architectural factors driving this counterintuitive result, revealing that GPU memory transfer overhead and CPU thread optimization play a critical role. Furthermore, we explore the impact of thread oversubscription, quantization strategies, and hardware constraints, providing new insights into efficient on-device AI execution. Our findings challenge conventional GPU-first thinking, highlighting the untapped potential of optimized CPU inference and paving the way for smarter deployment strategies in mobile AI. However, fully explaining the observed CPU advantage remains difficult due to limited access to low-level profiling tools on iOS.
摘要
设备端人工智能的普遍假设认为,凭借其卓越的并行处理能力,GPU始终能为大语言模型(LLM)推理提供最佳性能。本研究通过实证分析挑战了这一观点,发现在特定条件下,移动设备上CPU的LLM推理性能可超越GPU。我们在iPhone 15 Pro上通过llama.cpp部署10亿参数LLM,证明纯CPU配置(双线程,F16精度)可实现每秒17个token,优于GPU加速获得的12.8个token/秒。通过分析驱动这一反直觉结果的架构因素,我们发现GPU内存传输开销和CPU线程优化起关键作用。进一步探究了线程过载、量化策略和硬件限制的影响,为设备端AI高效执行提供了新见解。这些发现挑战了传统的GPU优先思维,揭示了优化CPU推理的未开发潜力,为移动AI部署策略开辟了新路径。然而,由于iOS底层分析工具的访问限制,完全解释观察到的CPU优势仍存在困难。
Towards Efficient LLM Storage Reduction via Tensor Deduplication and Delta Compression
Abstract
arXiv:2505.06252v1 Announce Type: new Abstract: Modern model hubs, such as Hugging Face, store tens of petabytes of LLMs, with fine-tuned variants vastly outnumbering base models and dominating storage consumption. Existing storage reduction techniques -- such as deduplication and compression -- are either LLM oblivious or not compatible with each other, limiting data reduction effectiveness. Our large-scale characterization study across all publicly available Hugging Face LLM repositories reveals several key insights: (1) fine-tuned models within the same family exhibit highly structured, sparse parameter differences suitable for delta compression; (2) bitwise similarity enables LLM family clustering; and (3) tensor-level deduplication offers strong synergy with model aware compressors. Building on these insights, we present BitX, an effective, fast, lossless delta compression algorithm that compresses XORed redundancy between fine-tuned and base LLMs. We build zLLM, a model storage reduction pipeline that unifies tensor-level deduplication and lossless BitX compression. By synergizing deduplication and compression around LLM family clustering, zLLM reduces model storage consumption by 49.5 percent, over 20 percent more than state-of-the-art deduplication and compression designs.
摘要
现代模型中心(如Hugging Face)存储着数十PB量级的大语言模型(LLM),其中微调变体的数量远超基础模型并主导存储消耗。现有存储缩减技术(如去重和压缩)要么未针对LLM优化,要么彼此不兼容,限制了数据缩减效果。我们对Hugging Face所有公开LLM仓库的大规模特征分析揭示了关键发现:(1) 同一家族的微调模型呈现高度结构化、稀疏的参数差异,适合采用增量压缩;(2) 比特级相似性支持LLM家族聚类;(3) 张量级去重与模型感知压缩器具有强协同效应。基于这些发现,我们提出BitX算法——一种高效、快速、无损的增量压缩方法,专门压缩微调LLM与基础模型间的XOR冗余。我们构建了zLLM存储缩减管道,统一整合张量级去重与无损BitX压缩。通过围绕LLM家族聚类协同去重与压缩,zLLM将模型存储消耗降低49.5%,较现有最优去重与压缩方案提升超20个百分点。